Nowadays, data-intensive problems are so prevalent that numerous organizations in various industries have to face\nthem in their business operation. It is often crucial for enterprises to have the capability of analyzing large volumes\nof data in an effective and timely manner. MapReduce and its open-source implementation Hadoop dramatically\nsimplified the development of parallel data-intensive computing applications for ordinary users, and the combination\nof Hadoop and cloud computing made large-scale parallel data-intensive computing much more accessible to all\npotential users than ever before. Although Hadoop has become the most popular data management framework\nfor parallel data-intensive computing in the clouds, the Hadoop scheduler is not a perfect match for the cloud\nenvironments. In this paper, we discuss the issues with the Hadoop task assignment scheme, and present an improved\nscheme for heterogeneous computing environments, such as the public clouds. The proposed scheme is based on an\noptimal minimum makespan algorithm. It projects and compares the completion times of all task slots� next data block,\nand explicitly strives to shorten the completion time of the map phase of MapReduce jobs. We conducted extensive\nsimulation to evaluate the performance of the proposed scheme compared with the Hadoop scheme in two types of\nheterogeneous computing environments that are typical on the public cloud platforms. The simulation results showed\nthat the proposed scheme could remarkably reduce the map phase completion time, and it could reduce the amount\nof remote processing employed to a more significant extent which makes the data processing less vulnerable to both\nnetwork congestion and disk contention.
Loading....